Multi scale-aware attention for pyramid convolution network on finger vein recognition

In recent years, biometrics has been the most popular style of personal identification. The finger vein is an intrinsic and stable trait, and with the ability to detect liveness, it receives academic and industry attention. However, convolution neural networks (CNNs) based finger vein recognition generally can only cover a small input region by using small kernels. Hence, the performance is poor, facing low-quality finger vein images. It is a challenge to effectively use the critical feature of multi-scale for finger veins. In this article, we extract multi-scale features via pyramid convolution. We propose scale attention, namely, the scale-aware attention (SA) module, which enables dynamic adjustment of the weight of each scale to information aggregation. Utilize the complementation of different scale detail features to enhance the discriminativeness of extracted features, thus improving the finger vein recognition performance. In order to verify the present method’s efficiency, we carried out experiments on two public data sets and one internal data, and the wide range of experimental results proves the proposed method’s effectiveness.

joint attention (JA) to focus on the details of features, which dynamically adjusts the information aggregation in the spatial channels.
Inspired by these prior efforts and image detection, we presented an end-to-end backbone based on a novel mechanism for residual attention called "Scale-aware Attention (SA)", which focuses on multi-receptive fields based on pyramid convolutional.The structure mainly consists of two stages: the feature extraction and the SA selection stages.In particular, it exploited pyramid convolutional to obtain different scale features from the input.This single-scale feature contains different details.Then, the SA choice stage leverages dynamic adjustment of the weights of different scale features to complement each feature.Finally, the refined feature was used for finger vein verification.The major contributions of the current work are outlined below: (1) We proposed a novel end-to-end model that replaces the stand convolutional with the pyramid convolutional.The proposed network uses a convolutional pyramid to extract multi-scale features via different kernel sizes to broaden receptive fields.(2) On the basis of SA, a coarse-to-fine feature extraction architecture is suggested which allows for end-to-end training, and the weights of different scale features can adjust adaptively.Using generalized mean pooling to generate the fused component weighted at different scales.(3) We conduct extensive experiments on public and in-house finger vein datasets, demonstrating proposed method outperforms existing SOTA methods while preserving good model and efficiency.

Related work A brief review of pyramid convolution
Pyramid convolution is widely used in image classification since it uses different convolution kernels to expand the receptive field and complementary information of different scales.Wang et al. 6 present a multi-scale pyramid Convolution module with spatial attention and channel attention to pay more attention to the images' object.
Liang et al. 7 designed a model named SC2Net, which obtains the predicted density maps using residual pyramid dilated convolution (ResPyDConv).Jie et al. 8 deliver a well-traced spatial pyramid module that gathers global and local cues to solve the challenge of scale variation in deep convolutional networks.Jia et al. 9 combined pyramiddilated convolutional blocks (PDCBs) with gated fusion units to reconstruct tiny image details since the PDCBs expand receptive fields and obtain details of the images via the network.Sun et al. 10 proposed a pyramid attention mechanism for classifying the Hyperspectral Images (HIS) based octave convolution network.Bao et al. 11 propose a residual attention pyramidal convolution for expression recognition-based VGG network, which uses CBAM attention to improve the weight of crucial aspects in multi-scale features extracted by pyramidal convolution.Moreover, it has better recognition performance and lightweight characteristics compared with other state-of-the-art networks.Due to the unique characteristics of finger vein images, like only grey channels and low contrast, extended receptive fields are a practical measure, and the studies on multi-receptive fields of finger veins mentioned in the previous section are successful example.

Attention mechanism
The attention mechanism diverts more attention to essential regions and disregards irrelevant parts, imitating the human visual system.Moreover, it has exhibited excellent performance in various computer vision tasks, for instance, object detection, image classification, meta-learning, and individual recognition.Attention mechanisms can be categorized as temporal attention, spatial attention, channel attention, compound attention, and branch attention.SENet 12 is the pioneered attention channel, which, via a squeeze-and-excitation (SE) block, includes a squeeze module and excitation module to improve representation ability by adjusting the weight of the channelwise relationship.Spatial transformer networks (STN) 13 are classical spatial attention systems that use a learnable module named spatial transformer to exploit spatial manipulation.Temporal attention is widely used for video processing to dynamically select when to pay attention.Selective kernel network (SKNet) 14 is classical branch attention, which uses soft attention to be guided by the information in these branches.The convolutional block attention module (CBAM) 15 represents compound attention, which recalibrates the significance of various spatial channels and positions by rescaling.
In finger vein authentication, there is much research about attention mechanisms.The attentions usually are a transformer of spatial attention, channel attention, and CBAM.As the multi-receptive field is extracted, the attention mechanism-based SKNet should be adequate for the multi-branch network, but it takes lots of computational power not applied.A new attention mechanism is designed from the scale level in this paper.

Methodology
This section presents a new finger vein identification network with Residual scale-aware attention, namely RSAFVNet.The main task of RSAFVNet is to capture the details with a multi-scale, which captures the intense discrimination with an end-to-end structure.Figure 1 gives an overview of the structure of RSAFVNet.We designed a backboned named Residual scale-aware attention Pyramidal Convolution (RSAPyConv), which uses pyramidal convolution instead of strand convolution to extract multi-scale finger vein feature.The specific RASPyConv module block structure, as shown in Fig. 2, mainly includes the pyramidal convolution, scale-aware attention that consists of scale descriptor and scale attention block, the global descriptor, and the loss function.

Pyramidal convolution
Convolutional neural networks are the workhorse for finger vein identification.The convolutional operation is the core of deep learning, which learns spatial kernels/filters for various computer vision tasks.Single kernel size standing convolution is used in most deep learning frameworks, which limits the network's receptive field.To expand the receiving field of the kernel, multi-branch CNNs were proposed, such as the representative InceptionNet families, but the amount of calculation was increased.Duta 16 proposed a multi-scale convolutional module named PyConv.An illustration is presented in Fig. 3, which displays a pyramid with n layers of various kernels, and the kernel has different spatial sizes for each level PyConv.With increasing the level of PyConv in the PyConv pyramid, the type of kernel size increased.In conjunction with the rise in space size, the depth of the kernel drops from level 1 to level n.
A varying kernel depth is applied at each level, and the input feature maps FM i were separated into different groups.Two examples of various convolutions are displayed in Fig. 4.This example has eight output and input   feature maps.The standard convolution is illustrated in Fig. 4a, where the depth of the kernel was equal to the number of input graphs.Consequently, each output feature map was associated with all input feature maps.Figure 4b indicates that the input feature maps were classified into four groups, each with the same kernel size.As a result, the kernel depth was decreased by four.Moreover, grouped convolution could significantly decrease the computational cost and the number of convolution parameters.On this basis, we introduced the PyConv into finger vein verification.Additionally, the critical issue for feature maps at various scales is how to capture the feature response between scales at each level.

Scale-aware attention
To our knowledge, the finger vein images are remarkably similar, so the feature extracted for finger vein recognition requires high visual expression ability to address the slight difference in types of finger veins; multi-scale receptive fields were extracted in pyramidal convolution.Directly fusing the different scale maps to learn more discriminative features will drown out the valuable information in the redundant information volume.In light of the above analysis, we presented the SA to adaptively exploit complementary characteristics at various scales produced by the pyramidal convolution.The suggested SA comprises a scale attention module (SAM) and a scale descriptor module (SDM).SDM uses the different group maps to generate the scale descriptor.Meanwhile, the SAM combines the level descriptor to refine the scale features.

Scale descriptor module
The operational detail of SDM as illustration Fig. 5.The out of Pyramidal convolution includes multi-scale feature maps are fed into the SDM and the output is denoted as ∈ R 4C×H×W and = R 4C×H×W .Firstly, SDM employs a 1 × 1 convolutional layer to decrease the channel sizes.Then the SDM reshapes the to R C×N , and ∈ R N×C where N = H × W represents the input finger vein image's pixels number.And as the Pyramidal convolution uses four type of keral sizes by groups, can be represented as = [ 1 , 2 , 3 , 4 ] , can be represented as =[ 1 , 2 , 3 , 4 ] .The i and i ∈ R C 4 ×N signal as the i-th scale.After that, SDM utilizes matrix multiplication between the transpose and , which is subsequently normalized with softmax lay to produce the scale descriptors: (1)  where ij = i j .As a result, the contains the information on the details of all scales.In this way, the scale descriptors produced by SDM contain both location and multi-scale information.

Scale attention module
We argue that single-scale features lack adequate semantic or spatial detail information to gain the performance of finger vein authentication.Hence, multi-scale information in the scale descriptors can augment single-scale features to give them greater discriminative power.In particular, as illustrated in Fig. 6, given the level descriptors and F i , i ∈ [1, 2, 3, 4] , SAM adopts a matrix multiplication method to perform the perception of multi- scale information.Consequently, SAM acquires attentional scale characteristics with scale awareness.To further strengthen the feature representation, SAM applies a skip-connection approach here.

Analysis performance of RASAPyConv
RASAPyConv is the core of the design.According to the structure of Fig. 2, it can be known that during the feature extraction process, the input is first through the pyramidal convolution module, which generates multiscale feature maps.Then, it is compressed by point convolution to bring the number of channels down to the number of single-scale channels, reducing network parameters.Finally, the feature maps through the SA attention mechanism are combined for dimension expansion.SA is mostly matrix transformation, so the central computation of the RASAPyConv module is generated by the PyConv module.The PyConv module is analyzed from two aspects: the number of parameters (space performance) and the computational requirements Floating Point Operations) (FLOPs) (time performance).
Assume that the PyConv module has FM i as the number of input feature channels, the dimensions of each www.nature.com/scientificreports/ (2) High efficiency The convolution of each layer of the PyConv module can realize independent parallel calculation, even run independently on different machines, and perform feature fusion.Hence, the overall calculation efficiency is high.

Global feature descriptor
For biometrics, the authentication is to obtain a score that a feature vector extracted from the biometric traits' enrollment data compared with the obtaining.This is consistent with the image retrieval domain, which uses the global feature descriptor.Compared with the local features, it summarizes an image's content as a compact representation, and similarity learning is not limited by the object posed and the image's qualities.As those limitations also influence the performance of finger vein identification and the high similarity of local texture features, we employed global feature descriptors to produce a compact feature with multi-scale receptive fields for finger vein authentication.GeM pooling can increase the features' representability via a nonlinear learnable operator, so it has been adopted for biometrics identification tasks 17 .Here, for a specific input finger vein image, the output of the RSAPyConv layer is a 3-dimensional tensor X with C × W × H size, indicating the number of channels, weight and height of feature maps separately.In computer vision, global max pooling (MAC 18,19 ) and global average pooling (SPoC 20 ) are generalized GeM pooling, which can be described as the below formulation: In which, the pooling parameter P c is a learnable parameter, when p c → ∞ or p c = 1 , the given form is the same as MAC or SPoC separately.The configurable and shareable pooling layer parameters allow for a highly non-linear presentation to be achieved with varying pooling parameters p c .What is more, the GeM pooling operation is divisible and can be optimized via back-propagation, guaranteeing that our presented RSAFVNet can be trained end-to-end.

Joint loss function strategy
The feature representation with small intra-class and significant inter-class distances can better promote finger vein recognition.In our experiment, a joint loss function, which combines softmax classification loss with a deep metric-based loss function (Center Loss function), was applied.Since the softmax loss only forces expand the inter-class distance of different classes, ignoring the intraclass distance, the finger veins' position and the lights may lead to a significant difference of the same finger vein in the capture, so that it may cause misclassification.The center loss function has a better ability to reduce the intraclass distance strongly, so we introduce it into the network, and the calculation process is as follows: where x represents the input features, c i denotes the center of all samples that have the same class label as y i .
Therefore, the joint loss function calculation process is as follows: where λ is used to keep them in balance.And we will discuss its influence in the following experimental section.

Experimental study
This section briefly describes the finger vein datasets utilized, the two public finger vein datasets, and an internal dataset.This section also describes the metrics that were evaluated.Then, it illustrates the detailed implementation, including default settings and training data preparation.Lastly, many experiments and their contents are employed to prove the effectiveness of the method presented, which includes ablation experiments, compared to other attention methods.At the same time, the superior performance of the presented method was proved compared to other state-of-the-art methods on a public benchmark dataset.

Datasets and evaluation protocol
Datasets (a) FV-USM (USM) The FV-USM 21 dataset includes 2952 images at all, which was conducted via Universiti Sains Malaysia, George Town, Malaysia.Finger vein images were collected from the middle and index fingers of 123 people's hands, and they were gathered six times for each finger.The original images are grayscale and 640 × 480 resolution.Also, there is a resolution of 100 × 300 for the ROI provided in this dataset.(b) SDUMLA-HMT (SDUMLA) The SDUMLA 22 datasets includes 3816 images in all, which was conducted via Shandong University, Jinan, China.Finger vein images were collected from the ring, middle and index fingers of 106 people on both hands, and they were taken six times for each finger.The original images are grayscale and 320 × 240 resolution.
(4) www.nature.com/scientificreports/(c) GERWIN The GERWIN 23 datasets includes 8316 images in all, which was performed by our team.Finger vein images were collected from the ring, middle and index fingers of 106 people on both hands, and they were taken six times for each finger.The original images are grayscale and 600 × 200 resolution.

Evaluation protocol
In evaluation protocol, we adopt genuine pairs and impostor pairs, which introduced in 24 .Expressed mathematically as: where the Ng represents the number of genuine pairs, the Ni represents the number of imposto pairs.Nc represents the number of finger vein categories in test set and Nf indicates the number of samples per finger vein category.
In order to fairly assess the behavior of various methods, equal error rate (EER) and accuracy (Acc) were calculated on a subset of tests.EER value equals the value when the false acceptance rate (FAR) equals the false rejection rate (FRR).The algorithms with a lower EER demonstrated improved performance on the validation tasks.
The open-set protocol was used in our experiment to assess the behavior of finger-vein verification, which is irrelevant to the actual application since the training classes and test classes do not overlap.In particular, for each database, one-half of the classes are randomized for training, and the remainder are applied for testing.In our trials, a fourfold cross-validation method was employed.

Implementation details
The above datasets have different resolutions, with redundant information and background fixed noise.We extracted the ROI image from those original images as the method applied in 25 and resized the finger vein samples to 224 × 224 × 3 for training.The grayscale one-channel sample was expanded to a three-channel image by copying twice as input.In 26 tradition, national data augmentation can improve the performance of finger veins and prevent overfitting.In our experiment, we adopt data augmentation methods such as flip and shear, etc., the specific ones listed in Table 1, which gives greater flexibility to train samples within each training cycle and to train finger vein samples without increase.The training samples of input are transformed randomly prior to entering each training epoch.In this format, the number of training samples remains constant.
In our experiments, the parameters are designed on the basis of previous experience of finger vein identification based on deep learning.The input batch size is 128, which includes 32 subjects and four samples in each subject.Furthermore, the epoch maximum is 100.The optimizer is an Adaptive Moment Estimation with an initial learning rate of 0.001 and dynamically adjusts using the ReduceLROnPlateau function.Specifically, the learning rate decreased to the original 0.1 when the validation set loss was not reduced within 20 batches.It should be pointed out that the experiment was conducted utilizing Python with the PyTorch frame on a workspace computer that had the below specifications: Intel(R)Core(TM) i7-8700 CPU @ 3.20 GHz RAM 16 GB, and GPU NVIDIA GeForce GTX 1060 6 GB.

Parameter selection
In order to assess the parameter λ performance of the network presented, in our tests, the scope of λ is between 0.01 and 5.The mean value was applied to determine the performance of the presented method, see Table 2. ( 7)  Table 2 displays the performance of the presented method in various datasets.This indicates that the suitability of the parameter λ can improve the recognition of finger veins.For instance, in the FVUSM dataset, the optimum performance of the presented method is achieved with λ equals to 0.1, the Acc of 99.92%, and EER of 1.02%.Similarly, in the SDU datasets, the optimum performance of the presented method with the λ equal 0.5, the Acc of 98.95%, and the EER of 0.73%.As for the GERWIN dataset, the proposed method's best performance with the λ equals 2, the Acc of 99.16%, and EER of 0.23%.

Loss function comparison
To assess the effectiveness of the joint loss function strategy, softmax, and the central loss function were compared to the joint loss function.The performance of different loss functions is listed in Table 3 in detail.
As illustrated in Table 3, the properties of the joint loss function-based FV-USM dataset are compared to the softmax loss and central loss functions; the Acc increased by 1.74% and 0.52%, and the EER increased by 1.33% and 0.40%, respectively.The performance of the SDUMLA-HMT datasets based on the joint loss function compared to the softmax loss and central loss functions, the Acc increased by 3.12% and 0.93%, and the ERR increased by 3.56% and 0.58%, respectively.The performance of the GERWIN datasets based on the joint loss function compared to the softmax loss and central loss functions, the Acc increased by 1.00% and 0.29%, and the ERR increased by 3.52% and 1.45%, respectively.

Effectiveness of the pyramidal convolution
An experiment was performed on SDUMLA-HMT, FV-USM, and GERWIN datasets to validate the significance of the pyramidal convolution.The pyramidal convolution was replaced by standard convolution, which has the same number of channels.The output of the standard convolution reduces the channel to one in four and splits it into four groups, achieving the scale-aware attention mechanism with a single scale.The performance of different convolutions is listed in Table 4.As shown, the pyramidal convolution outperforms the standard convolution.

Effectiveness of the SA attention module
Ablation study of SA attention In order to investigate the properties of the proposed module of SA, an ablation study of SA was performed on datasets of SDUMLA-HMT, FV-USM, and GERWIN.Furthermore, SA attention evolved from PA attention, contrasting with PA attention.The findings of the SA ablation study are presented in Table 5.
As shown in Table 5, the performance of the FV-USM datasets with SA attention compared with no attention and PA attention increased by 0.89% and 0.46%, and EER increased by 0.52% and 0.19%, respectively.The performance of the SDUMLA-HMT datasets with SA attention compared with no attention and PA attention, the Acc increased by 1.82% and 0.08%, and ERR increased by 0.64% and 0.39%, respectively.Compared with no attention and PA attention, the performance of the GERWIN datasets with SA attention increased by 1.62% and 0.47%, and ERR increased by 1.13% and 0.17%, respectively.

Compared with different attention
To better demonstrate the effectiveness of the presented scheme, two classical attentional mechanisms, CBAM and SE, were compared with the newly presented finger-vein joint attention.The optimum values are represented  www.nature.com/scientificreports/ in bold, as illustrated in Table 6.From these results, each attention module outperforms the baseline modules with no attention.Nevertheless, the module with SA attention presents the best performance on finger vein public datasets and in-house datasets with the highest Acc and the lowest EER.These present the validity of the method proposed.Figure 7 displays the findings of the two former experiments, where performance is better as the curve gets closer to the axis, and the intersection point between the diagonal and the curve is the EER.

Comparison with SOTA methods
We assessed our presented network through a comparison with state-of-the-art finger vein authentication approaches.The tests were performed on the public datasets SDUMLA and FV-USM.The findings are displayed in Tables 7 and 8, and the bold indicates the optimum value.
As we can see from the tables, our method has a slight advantage over the method proposed in Ref. 28 , and the method of extracted feature proposed does not need prior knowledge.Fang et al. proposed a double-weighted group sparse representation classification with some finger vein details lost, while our proposed method to extract multi-scale finger vein details feature.Liu et al. 31 proposed attention by connecting a residual block with multistage residual attention.They achieved a good performance in public datasets, and Huang et al. 32 presented a new transformer based on finger vein recognition, focusing on global information.However, the proposed    method is superior to MMRAN and FVT, and the design of the finger vein model of the end-to-end multi-scale attention mechanism in this paper is also inspired by them.Overall, the proposed method has the lowest EER and the highest Acc of the compared methods, which indicates that the proposed method is effective.Although the proposed method performs significantly in finger vein authentication, some issues still need further research.Firstly, the finger vein collected will be mixed with noise in the collection process to extract more compelling features and prevent illegal people from introducing potential backdoor attack threats.Second, different types of finger vein data are collected in different usage scenarios, which requires the model to have high generalization performance.Still, future work can consider cross-validation experiments to contribute to the optimization of model parameters and improve the generalization capability.Finally, with the increase of users, the query authentication time will also increase significantly, so more convenient authentication methods are also a focus of future research.

Figure 5 .
Figure 5. Detail of scale descriptor module.

Figure 6 .
Figure 6.Details of scale attention module.
This paper presented a novel end-to-end finger vein authentication network based on the pyramidal convolution with a scale-aware attention module block.Pyramid convolution learned finger vein features at different scales, and the scale-aware captured various detailed features with multi-scale, avoiding the redundancy of finger vein features.By jointly applying the central loss and softmax loss functions, the inter-class clustering was increased, and the distance of the intraclass was reduced.The model's effectiveness was verified on the open-source finger vein dataset and our in-house dataset.The ablation experiment demonstrated the effectiveness of the pyramid convolution module, joint loss function stagey, and the scale-aware attention mechanism module on recognizing the finger vein.

Table 2 .
Performance of different loss functions in different datasets. λ

Table 3 .
Performance comparison of different loss functions.

Table 4 .
Performance comparison of different convolutions.

Table 5 .
Performance of SA.

Table 6 .
Comparisons of different attention modules.

Table 8 .
Comparisons with the state-of-art on FV-USM.